Prediction of G4 formation in live cells with epigenetic data: a deep learning approach

Abstract G-quadruplexes (G4s) are secondary structures abundant in DNA that may play regulatory roles in cells. Despite the ubiquity of the putative G-quadruplex-forming sequences (PQS) in the human genome, only a small fraction forms G4 structures in cells. Folded G4, histone methylation and chromatin accessibility are all parts of the complex cis regulatory landscape. We propose an approach for prediction of G4 formation in cells that incorporates epigenetic and chromatin accessibility data. The novel approach termed epiG4NN efficiently predicts cell-specific G4 formation in live cells based on a local epigenomic snapshot. Our results confirm the close relationship between H3K4me3 histone methylation, chromatin accessibility and G4 structure formation. Trained on A549 cell data, epiG4NN was then able to predict G4 formation in HEK293T and K562 cell lines. We observe the dependency of model performance with different epigenetic features on the underlying experimental condition of G4 detection. We expect that this approach will contribute to the systematic understanding of correlations between structural and epigenomic feature landscape.

The first G4 prediction approaches were based on biophysical knowledge that a DNA motif with four runs of at least three guanines separated by loops of 1-7 nucleotides is likely to fold into a G4, and the first search of such patterns in the human genome resulted in around 376 000 matches ( 17 , 18 ). As the di v ersity of confirmed G4s was e xpanded, e.g. G4s with long loops ( 19 ), G4s with bulges ( 20 ) and G4s with missing guanines (21)(22)(23), algorithms for search and prediction of G4s e volv ed to accommodate the di v ersity of motifs as well. Novel whole-genome searches include inter-molecular G4s formed between the two DNA strands ( 24 ), and slightly mismatched sequences ( 25 ). Further approaches incorporated contextually enhanced prediction, such as G-runs continuity coupled with loop size ( 26 ), and nucleotide content bias ( 27 , 28 ). Recently, new experimental methods for G4 detection have emerged. A polymerase stop assay Illumina sequencing method was de v eloped that allowed to detect over 525 000 G4s in purified nuclear DNA in vitro , or more than 710 000 G4s with a G4-stabilising ligand ( 29 ) . In cellulo G4 detection methods include G4 chromatin immunoprecipitation sequencing (ChIP-seq) methods that use structure-specific antibodies to detect G4 formed in cellular cultures (30)(31)(32) and tumor xenografts ( 33 ), and G4 detection with cleavage under targets and tagmentation (CUT&Tag) technique ( 34 , 35 ). The numbers of reported G4s in cells vary greatly between the experimental methods and cell types (Table 1 ).
Methods for G4 detection in cells imply G4 detection ex vivo or in situ . The former means that the cells are fixed and chromatin is fragmented before it is enriched with a specific antibody, usually BG4 ( 30 , 31 , 36 ). The latter means that BG4 antibody permeates the plasma and nucleus membranes and tethers Tn5 transposase for tagmentation ( 35 , 38 ), or another small G4-binding protein, G4P, is expressed endo genousl y for a subsequent ChIP-seq experiment ( 32 ). Only a fraction of resulting G4s overlap in the same cell line using two different types methods --about 30-60% ( 35 ), or 45% ( 38 ). The di v ergent numbers of G4 formed in different types of cells and detected with different approaches may be due to both technical and biological variability and suggest the need for understanding the limits of technical artefacts and exploration of the biological causes.
Predicti v e models f or G4 f ormation based on these experimental data have been implemented: Quadron ( 40 ) and PENGUINN ( 41 ) trained on in vitro DNA G4  ( 31 , 35 , 37 , 38 , 42 ). Le v eraging multiple epigenomic features to predict G4 formation in cells and determining the importance of these features may provide novel insights into G4 formation mechanisms and their regulatory roles.

G4 existence in the epigenetic context
DNA G-quadruplex es ar e over-r epr esented in gene promoters and are thought to be involved in gene regulation at the transcription le v el. More than 40% of gene promoters in the human genome contain G4-forming motifs ( 44 ), and their structural properties make them attracti v e drug targets for diseases involving d ysregula tion of gene transcription ( 15 ). Folded G4s were confirmed to be highly enriched in gene promoters in cells more than in any other region ( 31 ), with some studies showing that most folded G4s are, in fact, located in the promoter regions in cells ( 35 ). Both G4s formed in cells ( 31 ) and gene promoters are associated with open chromatin ( 45 , 46 ), aiding the accessibility for transcription machinery. Open chromatin was indeed found to contain 85.8% of G4 in HaCaT cells and 97.2% in NHEK cells ( 31 ), whereas we found that for A549 cells G4 data ( 32 ), only 6.6% of G4s intersected with peaks fr om chr omatin accessibility (ATAC-seq) experiment, likely dri v en by a higher number of G4 peaks reported in A549. It has been hypothesized that G4 formation promotes transcription factor docking by keeping the DNA double helix open (47)(48)(49) and allows re-initiation of transcription.
Ther efor e, G4s ar e likely to be located in accessible chromatin regions due to the local regulatory roles, but the ways G4s are related to accessible chromatin and epigenetic marks are highly complex ( Figure 1 ). Recently, evidence of high colocalization of other epigenetic marks, such as histone modifications, with G4 formation became available ( 35 ). H3K4me3 was found to be the most correlated to G4 formation in HEK293T cells, as measured by G4 ChIP-seq, followed by H3K4me2, H3K4me1 and chromatin accessibility by ATAC-sequencing ( 35 ). Additionally, acti v e G4s are present in CpG island (CGIs) regions depleted in cytosine methylation ( 37 , 38 ) and inhibit methylation of DNA ( 37 ). CGI methyla tion pa tterns, in turn, mediate binding of specific families of transcription factors that have pr efer ence for either methylated or h ypometh ylated CGI ( 50 ), ther efor e leading to transcriptional regulation via epigenetic modifications. Another mechanism of G4 involvement in cellular processes regulation through epigenetic marking is the G4 involvement in DNA replication. It was demonstrated that G4 formed during DNA replication leads to epigenetic instability due to failure of copying the chromatin r epr essi v e mar ks ( 51 ). Additionally, G4 are known to recruit histone modification agents ( 52 , 53 ) and chr omatin remodelling pr oteins ( 54 ). The interplay between folded G4s and epigenetic marks is evident, however, it is still not sufficiently explored for G4 formation prediction. While the correlations of accessible chromatin and some epigenetic marks hint the relative importance of these features for G4 formation, our goal is to use machine learning to de v elop a model tha t can predict G4 forma tion in cells and rate the importance of these features for G4 formation.
Fi v e histone modification marks (H3K4me1, H3K4me3, H3K9me3, H3K27ac, H3K36me3) and chromatin accessi- bility (ATAC-seq) were selected for training and evaluation. We aim to predict weaker G4s formed in cells along with stable G4s. We base our predictions on a broad set of putati v e quadruplex-f orming sequences (PQS) f ound in the human genome according to the latest motif definitions (see Materials and Methods) and infer the PQS formation probability in cells. The de v eloped approach is designed to transfer the learned features to predict G4 in unseen types of cells based on the underlying sequence and local epigenetic snapshot.

G4 input pr epar ation f or epiG4NN model
PQS sequences were found with a regular expression search using python re package in the hg19 / GRCh37 human genome assemb ly, retrie v ed from the UCSC Genome Browser ( http://genome.ucsc.edu/ ). We broadened the existing definitions of G4 and used the following regular expressions: A total number of guanines greater than or equal 12 was r equir ed to allow thr ee-layer ed G4s. The search was performed on both strands. We filtered the redundant and nested G4s by only considering distinct sequences separa ted by a t least one nucleotide. The overlapping sequences were not merged. The first encountered motif from the 5 end of the overlapping group is considered. . For each PQS, the respecti v e G4 score was found from the experimental dataset by overlapping the PQS motif with experimental peaks and taking an average of the continuous .bedgraph signal. If there is no signal corresponding to the PQS coordinate, the score was set as 0.0. G4 labels for training were obtained from the G4P ChIP-seq study with GEO accession number GSE133379 for A549 and HEK293T cell lines. The upper 5 percentiles of the normalized experimental scor es wer e characterized as 'positi v e' class, and the rest of PQS as 'negati v e' (Supplementary Figure S2). As a result, > 105 000 PQS were classified as positi v e for A549 --a slightly more conservati v e number of peaks as compared to the number of peaks determined in the downstream analyses in the original study ( 32 ). A549 cell data were selected for training, and HEK293T for independent evaluation. For HEK293T, upper 2 percentiles were used to match the originally reported number of called peaks ( > 40 000). For additional evaluation and analyses, pre-processed G4 peaks for HeLa (GSE133379), HaCaT (GSE76688) and K562 (GSE107690) were used.

Chromatin accessibility and epigenetic marks data pr epar ation
Epigenetic information is generally preserved across tissues in species, especially for cell cultures with common progenitor cells ( 55 ), ther efor e, using G4 data and epigenetic data from different studies of the same cell line is possible. We used histone 3 lysine residue 4 methyla tion and trimethyla tion, histone 3 lysine 9 and lysine 36 residues trimethylation, histone 3 lysine 27 residue acetylation, and nucleosome availability as epigenetic marks for our experiments. We retrieved H3K4me1, H3K4me3, H3K9me3, H3K27ac, H3K36me3 ChIPseq datasets and ATAC-seq dataset for A549 cells, and H3K4me3, H3K27ac ChIP-seq for HEK293T from the Encyclopedia of DNA elements (ENCODE) ( 56 ) Reference Epigenome project with accession codes ENCFF633KDT, ENCFF021UDY , ENCFF474QYY , ENCFF126BYV, ENCFF053BXF, ENCFF735UWS, and accession codes ENCFF315TAU, ENCFF186KMN, respecti v ely. ATAC-seq signal for HEK293T was retrie v ed fr om GEO pr oject with accession code GSM3905877. For K562 cells, we used ENCODE projects with accession codes ENCFF252GZO for ATAC-seq, ENCFF929TPH for H3K4me3, and ENCFF488FYZ for H3K27ac. If the data were originall y ma pped to hg38, they were lifted over to hg19 using UCSC liftOver tool with a chain downloaded from the UCSC genome browser ( https: //hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/ ). We filter ed out r egions with no full coverage of any of the featur es to ensur e continuous data availability for e v ery PQS. All epigenetic data were normalized to a range of values [0, 1]. We then created input arrays of 1000-nt epigenetic profiles for each PQS with bedtools intersect command-line tool. Genomic tracks were visualized with pyGenomeT r acks tool ( 57 ).

Model ar chitectur e
epiG4NN ar chitectur e is based on a ResNet ( 58 ) --neural network class with residual convolutional layers and dilation. We designed our ar chitectur e as sequential convolutional towers of four residual blocks each, where each residual block contains ba tch normaliza tion ( ×2), rectified linear unit (ReLU) activation (x2) and convolutional layer with 32 kernels and variable kernel size ( ×2) (Supplementary Figure S3). We constructed a model with only sequence input ( G4NN ) as a baseline for epiG4NN . Model inputs are arrays of shape [1000, 4] for G4NN , sequencebased model, and arrays of shape [1000, 5] for epiG4NN . The first four rows of the input are the one-hot-encoded sequences, and the last row is the normalized epigenetic feature. Output of each residual unit is added to the penultimate layer through a 1D convolution with hidden state of size 32, ther efor e implementing skip, or r esidual, connections between units for better convergence and avoidance of vanishing / e xploding gradients prob lem ( 58 ). Classification objecti v e for this model is single-class, single label binarized probability of G4 formation. We performed a search for optimal hyperparameter set by training a grid of models with various hyperparameters. The optimal parameters were selected based on the performance of the model on the valida tion da taset during training: K = 3, W = [11,11,15], D = [1,4,10], where K is the number of conv olutional to wers, W i is the convolutional filter width in the i th residual block, D i is the dilation rate of the i th residual block. We additionally trained a similar set of models with 250-nt inputs to see whether the performance suffers significantly, and, indeed, the evaluation of the 250-nt model revealed lower AUROCs and AUPRCs (Supplementary Figure S4).

Model training and evaluation
Training data from A549 dataset were split into train and test subsets, where the test set contains all the PQS that belong to chromosomes 1, 3, 5, 7, 9, and train set -all PQS that belong to chromosomes 2, 4, 6, 8, 10-22, X, Y . The models were trained with batch size of 64 and a constant learning rate of 0.001 with Adam stochastic gradient descent optimizer method based on adapti v e estimation of lower-order moments ( 59 ). Upon initialization, model weights are filled with random numbers. At each training iteration, binary cross entropy loss function is minimized over the target train labels. Output labels are determined via sigmoid function applied to the ultimate logit.
Due to the large number of input data samples, chromosome data were input gradually and training on one chromosome from the train set constituted a full iter ation. Tr aining then continued with the next chromosome, to a total of 19 iteration steps in one epoch. Optimal hyperparameters were determined from the performance of the sequenceonly model on the validation dataset, as determined by the ar ea under pr ecision-r ecall curve (AUPRC), and applied to all the other models. Most of PQS had zero scor es, ther efore, the classes are highly imbalanced and are contributing unequally to the metrics and loss, with low probability class (unfolded G4) skewing the metrics. This problem was addressed by implementing balanced class weights for the optimizer: Respecti v e python scripts can be found at https://github.com/anyakors/ epiG4NN . For the promoter-enhancer bias assessment, gene promoter data for hg19 were downloaded from the UCSC table browser, and cell line-specific enhancer data were downloaded from the Enhancer Atlas v 2.0 database http://www.enhanceratlas.or g/do wnloadv2.php ( 60 ).

Broadening the definition of G4-forming motifs
Earl y a pproaches to defining G4-forming motifs used r egular expr essions of type [G 3+ L 1-7 ] 3+ G 3+ ( 17 , 18 , 61 ) and yielded around 376 000 putati v e G4s. Many G4s forming in vitro still did not adhere to this limited definition ( 29 ), including G4s with missing guanines ( 21-23 ), G4s with ultra-long loops ( 19 ), bulged G4s ( 20 ) and G4s with runs of irregular length ( 62 ), to name a few. Later search efforts considered variable numbers of guanines in the run ( 63 ), bulges and mismatches ('imperfections') ( 25 , 26 ), longer loops of up to 12 nucleotides ( 29 ), G-rich sequences considering cytosine bias ( 27 ), and duplex stem-containing loops ( 64 ). As the goal of this work was to define a set of G4 motifs covering a significant part of the in vivo formed peaks for training, we broadened the existing definitions of G4s and used the following r egular expr essions: [G 3+ L 1-12 ] 3+ G 3+ --canonical G4 pattern with extended loop length, similarly to ( 29 ); [GN 0-1 GN 0-1 GL 1-3 ] 3+ GN 0-1 GN 0-1 G --G4 pattern with possible bulges, with restrictions as described in ( 20 ); [G 1-2 N 1-2 ] 7+ G 1-2 --irregular G4 pattern alike those studied in ( 62 ). These three definitions do not aim to e xhausti v ely cover the G4 motif r epertoir e, and only select a r epr esentati v e set of G4 motifs for training. An alternati v e approach would be to use algorithms measuring the G-richness or G-skewness such as G4Hunter, but the definitions above cover a substantial variability of motifs for a r epr esentati v e and large sample of G4 sequences, plus different subtypes of G4s may be enriched in different regions of the genome, making interpretation of the downstream analyses easier. A total of 2 105 837 G4 motifs were determined after filtering (see Materials and Methods), where 907 845 instances are bulged G4s, 652 908 are irregular G4s, and 545 084 are canonical G4s with extended loop length.

G4s colocalize with accessible chromatin and epigenetic marks in cells
We interrogated the mean profiles of fiv e epigenetic marks and chromatin accessibility genome-wide at acti v e G4 loci in A549 cells ( 32 ) (Figure 2 ). Mean normalized signals of H3K4me1, H3K4me3, H3K9me3, H3K27ac, H3K36me3 histone modifications, and ATAC-seq signal were centered and plotted at PQS motifs with G4P ChIP-seq peaks. H3K4me3, H3K27ac and ATAC-seq profiles displayed positi v e association of epigenetic mark occupancy with the G4 peaks, while H3K4me1, H3K9me3 and H3K36me3 marks demonstrated the opposite trend. H3K4me1, H3K9me3 and H3K36me3 exhibit a dip at G4 sites compared to the genome-wide background le v el, while H3K4me3, H3K27ac and ATAC-seq demonstrate peaks at G4 sites compared to background le v el (Figure 2  Recently, similar results were reported for histone mark enrichment in mouse embryonic stem cells ( 38 ), where high confidence G4 CUT&Tag peaks overlapped with open chromatin, H3K4me3 and H3K27ac peaks, whereas H3K4me1 and H3K9me3 exhibited a local minimum.
Acti v e G4, H3K4me2 and H3K4me3 peaks followed by H3K27ac and open chromatin densely occupy gene promoter regions in HEK293T cells in another study ( 35 ). It was confirmed earlier that folded G4s are enriched at gene promoter regions characterized by open chromatin ( 31 ). Ther efor e, H3K4me3, and H3K27ac histone modifications and open chromatin (ATAC-seq) signal are good candidates to inform the G4 prediction in cells. We aim to extract the informati v e signal from both the epigenetic landscape and the PQS sequence with a neural network termed epiG4NN and compare it with the baseline G4NN that only uses the DNA sequence.

epiG4NN is a new hybrid model lever aging differ ent epigenetic marks
We de v eloped epiG4NN , a hybrid sequence and epigenetic context-based model for prediction of G4 formation in cells.
epiG4NN is a neural network model that consists of stacks of con volutional la yers with skip connections f or better training convergence ( 58 ). Convolutional networks (CNNs) are a class of deep learning methods that achie v ed significant breakthroughs in the genomic predictions (65)(66)(67)(68). Convolutional kernels slide along the inputs and extract input features, passed on to the next layers. CNNs allow for hierar chical r epr esenta tion of fea tures through learning the patterns in the input sequence without explicit feature engineering. For inputs, we used PQS and different auxiliary arrays of processed epigenetic marks aligned to PQS: histone 3 lysine 4 residue mono-methylation (H3K4me1), histone 3 lysine 4, 9 and 36 residues tri-methylation (H3K4me3, H3K9me3, H3K36me3), histone 3 lysine 27 residue acetylation (H3K27ac), and chromatin accessibility (ATAC ChIPseq data). Each epigenetic feature was tested independently (Figure 3 ). Comparisons were made with a sequence-based model G4NN with the same ar chitectur e and hyperparameters (number of residual stacks, convolutional kernel properties , learning rate , batch size). We trained our models on A549 G4 data from a RHAU-deri v ed G4-binding protein G4P-ChIP-seq experiment ( 32 ) and subsequently evaluated on both A549 withheld test samples and unseen HEK293T and K562 cell data.

Accur ate pr ediction of G4 f ormation in cells based on epigenetic features using epiG4NN
Upon optimization of hyperparameters, the best epiG4NN ar chitectur e was determined. We trained six epiG4NN models ( epiG4NN -H3K4me1,  epiG4NN -H3K4me3,  epiG4NN -H3K9me3, epiG4NN -H3K27ac, epiG4NN -H3K36me3, epiG4NN - . Of note, some of the used ENCODE da tasets (see Ma terials and Methods) have low coverage depth, as they were chosen for keeping the technical cross-laboratory variability to a minimum rather than maximizing the quality of the dataset. The model, howe v er, reached the state-of-the-art performance on these data, ther efor e we can expect it to perform e v en better on higher quality data.

Model learned from in situ data cannot predict well ex vivo data
Additionally, we tested another K562 cell line G4 dataset obtained with the ex vivo BG4 ChIP-seq method. Gi v en the same epigenetic profile and the same sequence motifs as for the K562 (CUT&Tag) data, with a different set of G4 peaks to predict, epiG4NN -H3K4me3 and epiG4NN - H3K27ac models were only able to perform marginally better than the G4NN baseline in terms of AUROC and AUPRC; epiG4NN -ATAC resulted in a slightly better evaluation (AUROC = 0.909, AUPRC = 0.342) ( Figure 6 ). The underlying technical / experimental condition difference between BG4 ChIP-seq and CUT&Tag methods of G4 detection in cells, ther efor e, makes the transfer of learned features challenging. The features learnt from one experimental condition may differ from the features in other conditions, resulting in poor cross-method model performance.

An example of differential epiG4NN prediction
The first intronic region of the human GRHL3 gene, located on the chromosome 1, contains se v eral PQS, where one PQS is formed in both HEK293T and A549 cell lines, while the other PQS is formed in HEK293T cells but not in A549 cells ( 32 ) (Figure 7 ). We created pseudo-genomic tracks that demonstrate the formation of these PQS using epiG4NN -H3K4me3 and making point predictions for each nucleotide in the region of interest.
We found se v en PQS corresponding to GRHL3 intron 1 r egion (Figur e 7 ), wher e four PQS ar e found on the plus strand (numbered 1, 3, 5 and 7), and three PQS on the minus strand (numbered 2, 4 and 6). We additionally refer to the in vitr o da ta. Interestingl y, onl y one in vitro formed G4 on the plus strand was matching the region, and it did not overlap with any of the PQS motifs. In cells, A549 G4P ChIP-seq data has one G4 peak corresponding to PQS number 3 and HEK293T has two peaks corresponding to PQS numbers 3, 5 ( Figure 7 ). epiG4NN -H3K4me3 captures the difference between the H3K4me3 features in A549 and HEK293T cell lines gi v en the same sequence of the G4 motif and predicts unique formation of the PQS number 5 in HEK293T, while PQS number 3 is predicted in both cell lines. Previously reported DeepG4 model only used G4 forming both in cells and in vitro for training and testing ( 42 ). The lack of matching in vitro peak highlights the importance of prediction in cells irrespecti v e of the G4 forma tion in vitr o .

epiG4NN -H3K4me3 exhibits a promoter and enhancer bias
Certain histone marks are known to be enriched in open chromatin, gene enhancer, and promoter regions of the genome ( 69 ). H3K4me3 is an 'acti v e' histone mark thought to play a role in transcription ( 70 , 71 ) and is marking gene promoters ( 72 ), while H3K27ac is marking gene enhancers ( 73 ). To test whether our model is biased towards these regions, we extracted G4 overlapping gene promoters ( 74 ) and cell-specific enhancers for A549 and HEK293T cell lines ( 75 ), and evaluated the region-specific performance of epiG4NN -H3K4me3. Evaluation re v ealed ov erall better AUPRC scores for enhancer and promoter regions compared to random regions ( Figure 8 ) together with the fact  HEK293T predictions with epiG4NN -H3K4me3 and G4 peaks detected with G4P ChIP-seq ( 32 ). PQS numbered 3 has a peak in both HEK293T and A549 cells, while PQS number 5 has a peak in HEK293T only. PQS number 5 is highlighted, and its sequence is shown. tha t a grea ter proportion of G4s is formed in promoter regions. Ubiquitous formation of G4s in the promoter regions was indeed experimentally confirmed previously ( 31 , 54 , 76 ). A significantly lower proportion of the G4s is formed in the enhancer regions (Figure 8 D), and e v en less so in the random r egions (Figur e 8 F), while the quality of prediction drops the most drastically for random regions in HEK293T cell lines.

DISCUSSION
Genome-wide G4 prediction methods are important for understanding G4 biology and for targeting such structures. Recent progress in chromatin immunoprecipitation methods was applied to G4 detection, and multiple experimental G4 datasets wer e r eported. Howe v er, the discrepancies between G4 formed in different cells and experiment types pointed to the need for understanding G4 formation in cells. The vast body of cellular epigenetic data allows to attribute G4 formation in cells to specific cellular features. So far, little is known about the correlations between G4 formation and such cellular da ta. Here, we demonstra te a novel ap-proach, epiG4NN, that comprises of a hybrid deep neural network that uses cellular epigenetic features and DNA sequence for G4 prediction in genomic DNA. Compared to pre viously pub lished methods, epiG4NN achie v es unprecedented precision and recall in G4 prediction in unseen cell lines. Additionally, epiG4NN allows to rate the relevance of H3K4me1, H3K4me3, H3K9me3, H3K27ac, H3K36me3 and chromatin accessibility signals to G4 prediction problem.
Through our studies on the A549, HEK293T and K562 data, we show that supplementing epigenetic data improves learning for prediction of G4 formation, as compared to sequence-based model G4NN with the same ar chitectur e and number of parameters. We demonstra te tha t epiG4NN -H3K4me3, epiG4NN -H3K27ac and epiG4NN -ATAC considerably outperform G4NN . H3K4me3 is the strongest predictor of the G4 formation in our studies on A549 and HEK293T (G4P ChIP-seq ( 32 )) and K562 cell data (CUT&Tag ( 34 )), followed by ATAC-seq and H3K27ac signals. We found that the optimal epigenetic predictor for cell lines depends on the experimental condition of the underlying da ta. Independent evalua tion of epiG4NN on G4 data obtained with G4P ChIP-seq ( 32 ) and CUT&Tag ( 34 ) showed that H3K4me3 is a good predictor of G4 formation for the G4 detected in situ , while open chromatin is the only predictor for ex vivo type of experiment (K562 data, BG4 ChIP-seq ( 37 )) that improves over the sequence-only prediction. This likely reflects the different approach behind the experimental datasets: K562 cells in BG4 ChIP-seq were fixed and chromatin was fragmented before it reacted with a G4-specific antibody, while A549 cells were subjected to a G4P knock-in and the antibody / G4-binding protein was e xpressed nati v ely. It is not clear how fragmentation and purifica tion of DNA af fects G4 forma tion, and fea tures learned by the model trained on in situ data do not seem to translate to another class of G4 detection experiment. The finding that ATAC-seq improves G4 predictions efficiency is in line with a previous report ( 42 ), while H3K4me3 was found to be highly colocalized with G4 sites in other recent studies ( 35 , 38 ). Additionally, we have demonstrated that only H3K4me3, H3K27ac and open chromatin signals contribute to a large number of acti v e G4 sites, while H3K4me1, H3K9me3 and H3K36me3 are largely depleted.
The key difference between epiG4NN and previously reported models lies in the usage and comparison of multiple epigenetic marks or features for contextual prediction of G4 formation in cells. We achie v e a better performance in G4 prediction and demonstrate relati v e importance of different epigenetic features. We additionally retrained the previously reported model DeepG4 ( 42 ) on our data to compare the model ar chitectur es, and obtained an AUROC or 0.981 and an AUPRC of 0.644 for the A549 left-out test samples (Supplementary Figure S5), demonstrating that our archi-tecture may be more suitable for this task. Additionally, unlike DeepG4 we employ a full snapshot of the local epigenetic profile, in contrast with a single average value for a gi v en G4 motif r egion. We show that our model can r eproduce peak signatures from two different cell lines, A549 and HEK293T, where a PQS is formed differentially. Our model, howe v er, suffers from a prediction accuracy bias in the random regions as compared with gene promoter or enhancer regions. We belie v e that epiG4NN can contribute to elucidation of the roles of chromatin accessibility and epigenetic marks in the sequence-structure relationship.